282 research outputs found

    A Compact Index for Order-Preserving Pattern Matching

    Full text link
    Order-preserving pattern matching was introduced recently but it has already attracted much attention. Given a reference sequence and a pattern, we want to locate all substrings of the reference sequence whose elements have the same relative order as the pattern elements. For this problem we consider the offline version in which we build an index for the reference sequence so that subsequent searches can be completed very efficiently. We propose a space-efficient index that works well in practice despite its lack of good worst-case time bounds. Our solution is based on the new approach of decomposing the indexed sequence into an order component, containing ordering information, and a delta component, containing information on the absolute values. Experiments show that this approach is viable, faster than the available alternatives, and it is the first one offering simultaneously small space usage and fast retrieval.Comment: 16 pages. A preliminary version appeared in the Proc. IEEE Data Compression Conference, DCC 2017, Snowbird, UT, USA, 201

    Compressed Spaced Suffix Arrays

    Full text link
    Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still support fast random access to them. We first prove a theoretical upper bound on the space needed to store an SSA when we already have the SA. We then present experiments indicating that our approach works even better in practice

    Wheeler Graphs: Variations on a Theme by Burrows and Wheeler

    Get PDF

    On the ordering of sparse linear systems

    Get PDF
    AbstractIn this paper we consider the algorithms for transforming an n × n sparse matrix A into another matrix B such that Gaussian elimination applied to B takes time asymptotically less than n3. These algorithms take the sparse matrix A as input, and return a pair of permutation matrices P, Q such that B = PAQ has a small bandwidth, or some other desirable form. We study the average effectiveness of these algorithms by using random matrices with Θ(n) nonzero elements. We prove that with high probability these algorithms cannot produce a reduction of the asymptotic cost of the standard Gaussian elimination algorithm.We also study the effectiveness of these algorithms for ordering very sparse matrices. We show that there exist matrices with 3n nonzeros for which reordering rows and columns does not reduce the asymptotic cost of Gaussian elimination. We also prove that each matrix with at most two nonzeros in each row and in each column, can be transformed into a banded matrix with bandwidth five

    XBWT Tricks

    Get PDF
    The eXtended Burrows-Wheeler Transform (XBWT) is a data transformation introduced in [Ferragina et al., FOCS 2005] to com- pactly represent a labeled tree and simultaneously support navigation and path-search operations over its label structure. A natural application of the XBWT is to store a dictionary of strings. A recent extensive experimental study [Martı́nez-Prieto et al., Informa- tion Systems, 2016] shows that, among the available string dictionary implementations, the XBWT is attractive because of its good tradeoff between small space usage, speed, and support for substring searches. In this paper we further investigate the use of the XBWT for storing a string dictionary. Our first contribution is to show how to add suffix links (aka failure links) to a XBWT string dictionary. For a XBWT dictionary with n internal nodes our suffix links can be traversed in constant time and only take 2n + o(n) bits of space. Our second contribution are practical construction algorithms for the XBWT, including the additional data structure supporting the traver- sal of suffix links. Our algorithms build on the many well engineered algorithms for Suffix Array and BWT construction and offer different tradeoffs between running time and working space

    On Computing the Entropy of Cellular Automata

    Get PDF
    We study the topological entropy of a particular class of dynamical systems: cellular automata. The topological entropy of a dynamical system (X,F) is a measure of the complexity of the dynamics of F over the space X. The problem of computing (or even approximating) the topological entropy of a given cellular automata is algorithmically undecidable (Ergodic Theory Dynamical Systems 12 (1992) 255). In this paper, we show how to compute the entropy of two important classes of cellular automata namely, linear and positively expansive cellular automata. In particular, we prove a closed formula for the topological entropy of D-dimensional (D?1) linear cellular automata over the ring and we provide an algorithm for computing the topological entropy of positively expansive cellular automata

    Multiple seeds sensitivity using a single seed with threshold

    Get PDF
    Spaced seeds are a fundamental tool for similarity search in biosequences. The best sensitivity/selectivity trade-offs are obtained using many seeds simultaneously: This is known as the multiple seed approach. Unfortunately, spaced seeds use a large amount of memory and the available RAM is a practical limit to the number of seeds one can use simultaneously. Inspired by some recent results on lossless seeds, we revisit the approach of using a single spaced seed and considering two regions homologous if the seed hits in at least t sufficiently close positions. We show that by choosing the locations of the don't care symbols in the seed using quadratic residues modulo a prime number, we derive single seeds that when used with a threshold t > 1 have competitive sensitivity/selectivity trade-offs, indeed close to the best multiple seeds known in the literature. In addition, the choice of the threshold t can be adjusted to modify sensitivity and selectivity a posteriori, thus enabling a more accurate search in the specific instance at issue. The seeds we propose also exhibit robustness and allow flexibility in usage
    corecore